perm filename TTT[4,KMC]1 blob sn#021293 filedate 1973-01-24 generic text, type T, neo UTF8
00100		HOW TO USE AND HOW NOT TO USE TURING-LIKE TESTS
00200	            IN EVALUATING THE ADEQUACY OF SIMULATION MODELS
00300	       K.M. COLBY AND F.D. HILF
00400	
00500		It is very easy to become confused about  Turing's  imitation
00600	game.    In part this is due to Turing himself when in his 1950 paper
00700	entitled COMPUTING  MACHINERY  AND  INTELLIGENCE  he  introduced  his
00800	imitation  game [3 ]. A careful reading of this paper reveals there are
00900	actually two games proposed , the second of which is commonly  called
01000	Turing's test.
01100		In the first imitation game  two  groups  of  judges  judges,
01200	using   teletyped   interviews,   try   to  determine  which  of  two
01300	interviewees is a woman. Each judge is initially informed that o∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈e rejected, especially since
09200	statistical   tests  are  biased  in  favor  of  rejecting  the  null
09300	hypothesis [3]. Yet this answer does not tell us what we would  most
09400	like to know, i.e. how to improve the model. Simulation models do not
09500	spring forth in a complete, final and zero-defect form; they must  be
09600	gradually  developed over time. Pehaps we might obtain a "yes" answer
09700	to the machine-question if we allowed a large number of expert judges
09800	to conduct the interviews themselves rather than studying transcripts
09900	of other interviewers.   It would indicate that  the  model  must  be
10000	improved  but  unless  we  systematically investigated how the judges
10100	succeeded in making the discrimination we would not know what aspects
10200	of  the  model to work on. The logistics of such a design are immense
10300	and obtaining a large N of judges  for  sound  statistical  inference
10400	would require an effort disproportionate to the information-yield.
10500		A more efficient and informative way to use Turing-like tests
10600	is  to ask judges to make ordinal ratings along scaled dimensions from
10700	teletyped interviews.    We  shall  term  this  approach  asking  the
10800	dimension-question.   One can then compare scaled ratings received by
10900	the patients and by the model to precisely determine where and by how much they
11000	differ.        Model   builders   strive  for  a  model  which  shows
11100	indistinguishability along  some  dimensions  and  distinguishability
11200	along  others.  That is the model converges on what it is supposed to
11300	simulate and diverges from that which it is not.
11400		We   mailed   paired-interview  transcripts  to  another  100
11500	randomly selected psychiatrists asking them to rate the responses  of
11600	the  two `patients' along certain dimensions. The judges were divided
11700	into groups, each judge being asked to rate  responses  of  each  I-O
11800	pair in the interviews along  four  dimensions. The total number of dimensions in this
11900	test were  twelve-  linguistic  noncomprehension,  thought  disorder,
12000	organic brain syndrome, bizarreness, anger, fear, ideas of reference,
12100	delusions, mistrust, depression, suspiciousness and mania. These  are
12200	dimensions which psychiatrists commonly use in evaluating patients.
12300		Table 1
12400		Table  1  shows  there were significant differences, with the
12500	model eceiving higher  scores  along  the  dimensions  of  linguistic
12600	noncomprehension, bizarreness, anger, mistrust and suspiciousness. On
12700	the dimension of delusions the patients were rated higher. There were
12800	no  significant  differences  along  the  dimensions of organic brain
12900	syndrome,fear, ideas of reference, depression and mania.
13000		While    tests    asking    the   machine-question   indicate
13100	indistinguishability at  the  gross  level,  a  study  of  the  finer
13200	structure  os  the  model's  behavior  through  ratings  along scaled
13300	dimensions  shows  statistically  significant   differences   between
13400	patients  and  model.   These  differences  are  of help to the model
13500	builder in suggesting which aspects of the model must be modified and
13600	improved in order to be considered an adequate simulation of the
13700	class of paranoid patients it is intended to simulate.  For  example,
13800	it  is  clear  that  the  language-comprehension of the model must be
13900	improved.  Once this has been implemented, a future test will tell us
14000	whether improvement has occurred and by how much in comparison to the
14100	earlier version.   Successive identification of particular  areas  of
14200	failure in the model permits their improvement and the development of
14300	more adequate model-versions.
14400		Further  evidence that the machine-question is an insensitive
14500	test appears in Table 2. In this test we constructed a random version
14600	of  the  paranoid  model  which utilized the output statements of the
14700	original model  but  expressed  them  randomly  no  matter  what  the
14800	interviewer  said.   Two psychiatrists conducted interviews with this
14900	model, transcripts of which were paired with patient  interviews  and
15000	sent   to   200  randomly  selected  psychiatrists  asking  both  the
15100	machine-question and the dimension-question. Replies were so  few  to
15200	the  first  mailing of 100 that another mailing was needed to achieve
15300	the required N - another fact to ponder.  Of the 69 replies, 34 (49%)
15400	were  right  and  35  (51%)  wrong. Based on this random sample of 69
15500	psychiatrists we are 95% confident that between 39% and  63%  of  all
15600	psychiatrists could make the correct identification, again indicating
15700	a chance level.  However as shown in  table  2  definite  differences
15800	appear  along  the dimensions of linguistic noncomprehension, thought
15900	disorder (get other dimensions  from  table).   On  these  particular
16000	dimensions  we  can construct a continuum in which the random version
16100	represents one extreme, the actual patients another.  Our (nonrandom)
16200	model  lies  somewhere between these two extremes, indicating that it
16300	performs significantly better  than  the  random  version  but  still
16400	requires improvement before being indistinguishable from patients. In
16500	other words this approach provides  yardsticks for measuring the adequacy of this or
16600	any other dialogue simulation model along the relevant dimensions.
16700		We conclude that when model builders want  to  conduct  tests
16800	which  indicate  in  which  direction  progress  lies and to obtain a
16900	measure of whether  progress  is  being  achieved,  the  way  to  use
17000	Turing-like  tests  is  to  ask  expert  judges to make ratings along
17100	multiple dimensions considered essential to the model.  Useful  tests
17200	do  not  prove  a  model, they probe it for its sensitivities. Simply
17300	asking the machine-question yields  no  information  about  improving
17400	what  the model builder knows is only a first approximation. His main
17500	problem is then  how to get on with it.
17600	
17700	
17800		REFERENCES
18000	[1] Colby, K.M., Hilf,F.D., Weber, S. and Kraemer,H.C. Turing-like
18100	    indistinguishability tests for the validation of a computer
18200	    simulation of paranoid processes. ARTIFICIAL INTELLIGENCE,3,
18300	    (1972),199-221.
18400	[2] Meehl, P.E., Theory testing in psychology and physics: a
18500	   methodological paradox. PHILOSOPHY OF SCIENCE,34,(1967),103-115.
18600	
18700	
18800	[3] Turing,A. Computing machinery and intelligence. Reprinted in:
18900	    COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
19000	    McGraw-Hill, New York,1963,pp. 11-35.